Assignment 2

Problem 1: Speech Denoising Using Deep Learning

  1. If you took my MLSP class, you may think that you’ve seen this problem. But, it’s actually somewhat different from what you did before, so read carefully. And, this time you SHOULD implement a DNN with at least two hidden layers. So, don’t reuse your legacy MATLAB code for this problem.

  2. When you attended IUB, you took a course taught by Prof. K. Since you really liked his lectures, you decided to record them without the professor’s permission. You felt awkward, but you did it anyway because you really wanted to review his lectures later.

  3. Although you meant to review the lecture every time, it turned out that you never listened to it. After graduation, you realized that a lot of concepts you face at work were actually covered by Prof. K’s class. So, you decided to revisit the lectures and study the materials once again using the recordings.

  4. You should have reviewed your recordings earlier. It turned out that a fellow student who used to sit next to you always ate chips in the middle of the class right beside your microphone. So, Prof. K’s beautiful deep voice was contaminated by the annoying chip eating noise.

  5. But, you vaguly recall that you learned some things about speech denoising and source sep- aration from Prof. K’s class. So, you decided to build a simple deep learning-based speech denoiser that takes a noisy speech spectrum (speech plus chip eating noise) and then produces a cleaned-up speech spectrum.

  6. Since you don’t have Prof. K’s clean speech signal, I prepared this male speech data recorded by other people. train dirty male.wav and train clean male.wav are the noisy speech and its corresponding clean speech you are going to use for training the network. Take a listen to them. Load them and covert them into spectrograms, which are the matrix representation of signals.

Loading Libraries

Loading the data and performing stft

  1. Take their magnitudes by using np.abs() or whatever suitable method, because S and X are complex valued. Let’s call them |S| and |X|.
  1. Train a fully-connected deep neural network. A couple of hidden layers would work, but feel free to try out whatever structure, activation function, initialization scheme you’d like. The input to the network is a column vector of |X| (a 513-dim vector) and the target is its corresponding one in |S|. You may want to do some mini-batching for this. Make use of whatever functions in Tensorflow or Pytorch.
  2. But, remember that your network should predict nonnegative magnitudes as output. Try to use a proper activation function in the last layer to make sure of that. I don’t care which activation function you use in the middle layers.

Model Architechture

Training the model

  1. test 01 x.wav is the test noisy signal. Load them and apply STFT as before. Feed the mag- nitude spectra of this test mixture |Xtest| to your network and predict their clean magnitude spectra |Sˆtest|. Then, you can recover the (complex-valued) speech spectrogram of the test signal in this way:

$ \widehat{S} = \frac{X_{test}}{|X_{test}|}\odot \widehat{|S_{test}|} $

which means you take the phase information of the input noisy signal Xtest and use that to |X test | recover the clean speech. ⊙ stands for the Hadamard product and the division is element-wise, too.

  1. Recover the time domain speech signal by applying an inverse-STFT on Sˆtest, which will give you a vector. Let’s call this cleaned-up test speech signal sˆtest. I’ll calculate something called Signal-to-Noise Ratio (SNR) by comparing it with the ground truth speech I didn’t share with you. It should be reasonably good. You can actually write it out by using the following code:

Test file 1 (test 01 x.wav)

Predicting on the test data

Recovering the signal by doing inverse stft

Recovered signal removing noise

  1. Do the same testing procedure for test 02 x.wav, which actually contains Prof. K’s voice along with the chip eating noise. Enjoy his enhanced voice using your DNN.

Test file 2 (test 02 x.wav)

Predicting on the test data

Recovered signal

Problem 2: Speech Denoising Using 1D CNN

  1. As an audio guy it’s sad to admit, but a lot of audio signal processing problems can be solved in the time-frequency domain, or an image version of the audio signal. You’ve learned how to do it in the previous homework by using STFT and its inverse process.

  2. What that means is nothing stops you from applying a CNN to the same speech denoising problem. In this question, I’m asking you to implement a 1D CNN that does the speech denoising job in the STFT magnitude domain. 1D CNN here means a variant of CNN which does the convolution operation along only one of the axes. In our case it’s the frequency axis.

  3. Like you did in homework 1 Q2, install/load librosa. Take the magnitude spectrograms of the dirty signal and the clean signal |X| and |S|.

  4. Both in Tensorflow and PyTorch, you’d better transpose this matrix, so that each row of the matrix is a spectrum. Your 1D CNN will take one of these row vectors as an example, i.e. |X|⊤:,i. Since this is not an RGB image with three channels, nor you’ll use any other information than just the magnitude during training, your input image has only one channel (depth-wise). Coupled with your choice of the minibatch size, the dimensionality of your minibatch would be like this: [(batch size) × (number of channels) × (height) × (width)] = [B × 1 × 1 × 513]. Note that depending on the implementation of the 1D CNN layers in TF or PT, it’s okay to omit the height information. Carefully read the definition of the function you’ll use.

  5. You’ll also need to define the size of the kernel, which will be 1 × D, or simply D depending on the implementation (because we know that there’s no convolution along the height axis).

  1. If you define K kernels in the first layer, the output feature map’s dimension will be [B × K × 1 × (513 − D + 1)]. You don’t need too many kernels, but feel free to investigate. You don’t need too many hidden layers, either.

  2. In the end, you know, you have to produce an output matrix of [B × 513], which are the approximation of the clean magnitude spectra of the batch. It’s a dimension hard to match using CNN only, unless you take care of the edges by padding zeros (let’s not do zero-padding for this homework). Hence, you may want to flatten the last feature map as a vector, and add a regular linear layer to reduce that dimensionality down to 513.

  1. Meanwhile, although this flattening-followed-by-linear-layer approach should work in theory, the dimensionality of your flattened CNN feature map might be too large. To handle this issue, we will used the concept we learned in class, striding: usually, a stride larger than 1 can reduce the dimensionality after each CNN layer. You could consider this option in all convolutional layers to reduce the size of the feature maps gradually, so that the input dimensionality of the last fully-connected (FC) layer is manageable. Maxpooling, coupled with the striding technique, would be something to consider.
  1. Be very careful about this dimensionality, because you have to define the input and output dimensionality of the FC layer in advance. For example, a stride of 2 pixels will reduce the feature dimension down to roughly 50%, though not exactly if the original dimensionality is an odd number.
  1. Don’t forget to apply the activation function of your choice, at every layer, especially in the last layer.
  1. Try whatever optimization techniques you’ve learned so far.
  1. Check on the quality of the test signals you used in P1. Here, once again, you have no good way to judge the quality of the test results other than listening to them.

Loading the data and doing stft

Taking absolute value and reshaping

Model Architechture

Fitting the model

Test file 1 (test 01 x.wav)

Reshaping the test output and doing prediction

Extracting the recovered signal

Recovered signal

Test file 2 (test 02 x.wav)

Reshaping the test output and doing prediction

Extracting the recovered signal

Recovered Signal

Problem 3: Speech Denoising Using 2D CNN

  1. Now that we know the audio source separation problem can be solved in the image represen- tation, nothing stops us from using 2D CNN for this.

  2. To this end, let’s define our input “image” properly. You extract an image of 20 × 513 out of the entire STFT magnitude spectrogram (transposed). That’s an input sample. Using this your 2D CNN estimates the cleaned-up spectrum that corresponds to the last (20th) input frame: into account to predict the clean spectrum of the current frame, t + 19.

  3. Your next image will be another 20 frames shifted by one frame: |S⊤:,t+20| ≈ FCNN􏰀|X⊤:,t+1:t+20|􏰁, (4) and so on. Therefore, a pair of adjacent images (unless you shuffle the order) will be with 19 overlapped frames. Since your original STFT spectrogram has 2,459 frames, you can create 2,440 such images as your training dataset.

  4. Therefore the input to the 2D CNN will be of [(batch size) ×1 × 20 × 513].

  5. Your 2D CNN should be of course with a kernel whose size along both the width (frequencies) and the height axes (frames) should be larger than 1. Feel free to investigate different sizes.

  6. Otherwise, the basic idea must be similar with the 1D CNN case. You’ll still need those techniques as well as the FC layer.

  7. Report the denoising results in the same way. One thing to note is that your output will be with only 2,440 spectra, and it’s lacking the first 19 frames. You can ignore those first few frames when you calculate the SNR of the training results. A better way is to augment your input X with 19 silent frames (some magnitude spectra with very small random numbers) in the beginning to match the dimension. I recommend the latter approach.

Creating a 3D matrix for training

Model architechture

Fitting the model

Prediction on train data

Prediction on Test file 1 (test 01 x.wav)

Creating 3d matrix for Test file 1

Prediction

Recovered

Prediction on Test file 2 (test 02 x.wav)

Creating 3 D matrix for test file 2

Prediction on test file 2

Recovered signal